Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition

نویسندگان

چکیده

Video-based person recognition is challenging due to persons being blocked and blurred, the variation of shooting angle. Previous research always focused on still images, ignoring similarity continuity between video frames. To tackle challenges above, we propose a novel Frame Aggregation Multi-Modal Fusion (FAMF) framework for video-based recognition, which aggregates face features incorporates them with multi-modal information identify in videos. For frame aggregation, trainable layer based NetVLAD (named AttentionVLAD), takes arbitrary number as input computes fixed-length aggregated feature quality. We show that introducing an attention mechanism into effectively decreases impact low-quality multi-model videos, Multi-Layer Attention (MLMA) module learn correlation multi-modality by adaptively updating Gram matrix. Experimental results iQIYI-VID-2019 dataset our outperforms other state-of-the-art methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

tight frame approximation for multi-frames and super-frames

در این پایان نامه یک مولد برای چند قاب یا ابر قاب تولید شده تحت عمل نمایش یکانی تصویر برای گروه های شمارش پذیر گسسته بررسی خواهد شد. مثال هایی از این قاب ها چند قاب های گابور، ابرقاب های گابور و قاب هایی برای زیرفضاهای انتقال پایاست. نشان می دهیم که مولد چند قاب تنک نرمال شده (ابرقاب) یکتا وجود دارد به طوری که مینیمم فاصله را از ان دارد. همچنین مسایل مشابه برای قاب های دوگان مطرح شده و برخی ...

15 صفحه اول

Multi-modal Person Recognition for Vehicular Applications

In this paper, we present biometric person recognition experiments in a real-world car environment using speech, face, and driving signals. We have performed experiments on a subset of the in-car CIAIR corpus collected at the Nagoya University, Japan. We have used Mel-frequency cepstral coefficients (MFCC) for speaker recognition. For face recognition, we have reduced the feature dimension of e...

متن کامل

Multi-modal Aggregation for Video Classification

In this paper, we present a solution to Large-Scale Video Classification Challenge (LSVC2017) [1] that ranked the 1st place. We focused on a variety of modalities that cover visual, motion and audio. Also, we visualized the aggregation process to better understand how each modality takes effect. Among the extracted modalities, we found Temporal-Spatial features calculated by 3D convolution quit...

متن کامل

Fusion of audio and video information for multi modal person authentication

متن کامل

Multi-modal analysis for person type classification in news video

Classifying the identities of people appearing in broadcast news video into anchor, reporter, or news subject is an important topic in high-level video analysis. Given the visual resemblance of different types of people, this work explores multi-modal features derived from a variety of evidences, such as the speech identity, transcript clues, temporal video structure, named entities, and uses a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-67832-6_7